Video anomaly detection (VAD) is a challenging computer vision task with many practical applications. As anomalies are inherently ambiguous, it is essential for users to understand the reasoning behind a system's decision in order to determine if the rationale is sound. In this paper, we propose a simple but highly effective method that pushes the boundaries of VAD accuracy and interpretability using attribute-based representations. Our method represents every object by its velocity and pose. The anomaly scores are computed using a density-based approach. Surprisingly, we find that this simple representation is sufficient to achieve state-of-the-art performance in ShanghaiTech, the largest and most complex VAD dataset. Combining our interpretable attribute-based representations with implicit, deep representation yields state-of-the-art performance with a $99.1\%, 93.3\%$, and $85.9\%$ AUROC on Ped2, Avenue, and ShanghaiTech, respectively. Our method is accurate, interpretable, and easy to implement.
translated by 谷歌翻译
Labeling large image datasets with attributes such as facial age or object type is tedious and sometimes infeasible. Supervised machine learning methods provide a highly accurate solution, but require manual labels which are often unavailable. Zero-shot models (e.g., CLIP) do not require manual labels but are not as accurate as supervised ones, particularly when the attribute is numeric. We propose a new approach, CLIPPR (CLIP with Priors), which adapts zero-shot models for regression and classification on unlabelled datasets. Our method does not use any annotated images. Instead, we assume a prior over the label distribution in the dataset. We then train an adapter network on top of CLIP under two competing objectives: i) minimal change of predictions from the original CLIP model ii) minimal distance between predicted and prior distribution of labels. Additionally, we present a novel approach for selecting prompts for Vision & Language models using a distributional prior. Our method is effective and presents a significant improvement over the original model. We demonstrate an improvement of 28% in mean absolute error on the UTK age regression task. We also present promising results for classification benchmarks, improving the classification accuracy on the ImageNet dataset by 2.83%, without using any labels.
translated by 谷歌翻译
异常检测方法努力以语义方式发现与规范不同的模式。这个目标是模棱两可的,因为数据点与规范不同的属性不同,例如年龄,种族或性别,可能被某些操作员认为是异常的,而其他操作员可能认为这种属性无关紧要。从先前的研究中断,我们提出了一种新的异常检测方法,该方法使操作员可以将属性排除在被认为与异常检测相关的情况下。然后,我们的方法学习了不包含有关滋扰属性的信息的表示形式。使用基于密度的方法进行异常评分。重要的是,我们的方法不需要指定与检测异常相关的属性,这在异常检测中通常是不可能的,而是只能忽略的属性。提出了一项实证研究,以验证我们方法的有效性。
translated by 谷歌翻译
Despite significant advances in image anomaly detection and segmentation, few methods use 3D information. We utilize a recently introduced 3D anomaly detection dataset to evaluate whether or not using 3D information is a lost opportunity. First, we present a surprising finding: standard color-only methods outperform all current methods that are explicitly designed to exploit 3D information. This is counter-intuitive as even a simple inspection of the dataset shows that color-only methods are insufficient for images containing geometric anomalies. This motivates the question: how can anomaly detection methods effectively use 3D information? We investigate a range of shape representations including hand-crafted and deep-learning-based; we demonstrate that rotation invariance plays the leading role in the performance. We uncover a simple 3D-only method that beats all recent approaches while not using deep learning, external pre-training datasets, or color information. As the 3D-only method cannot detect color and texture anomalies, we combine it with color-based features, significantly outperforming previous state-of-the-art. Our method, dubbed BTF (Back to the Feature) achieves pixel-wise ROCAUC: 99.3% and PRO: 96.4% on MVTec 3D-AD.
translated by 谷歌翻译
异常检测方法识别偏离数据集的正常行为的样本。它通常用于训练集,其中包含来自多个标记类或单个未标记的类的普通数据。当前方法面对培训数据时争取多个类但没有标签。在这项工作中,我们首先发现自我监督的图像聚类方法学习的分类器为未标记的多级数据集上的异常检测提供了强大的基线。也许令人惊讶的是,我们发现初始化具有预先训练功能的聚类方法并不能改善其自我监督的对应物。这是由于灾难性遗忘的现象。相反,我们建议了两级方法。我们使用自我监督方法群集图像并为每个图像获取群集标签。我们使用群集标签作为“伪监督”,用于分销(OOD)方法。具体而言,我们通过群集标签对图像进行分类的任务进行预训练功能。我们提供了我们对方法的广泛分析,并展示了我们两级方法的必要性。我们评估符合最先进的自我监督和预用方法,并表现出卓越的性能。
translated by 谷歌翻译
语义细分是一项关键的计算机视觉任务,该任务已经积极研究了几十年。近年来,监督方法已达到前所未有的准确性,但是每个新的类别类别都需要许多像素级注释,这是非常耗时和昂贵的。另外,当前语义分割网络处理大量类别的能力是有限的。这意味着包含稀有类别类别的图像不太可能通过当前方法进行很好的分割。在本文中,我们提出了一种为每个对象创建语义细分掩码的新方法,而无需训练分割网络或查看任何分割掩模。我们的方法用作图像中存在的类类别的图像级标签;它们可以自动或手动获得。我们利用Vision语言嵌入模型(特别是Clip)来使用模型解释性方法为每个类创建粗略分割映射。我们使用测试时间增强技术来优化地图。此阶段的输出提供像素级伪标签,而不是监督方法所需的手动像素级标签。鉴于伪标签,我们利用单图像分割技术来获得高质量的输出分割掩模。我们的方法是定量和定性地示出的,以优于使用类似的监督数量的方法。我们的结果对于包含罕见类别的图像特别显着。
translated by 谷歌翻译
在本文中,我们基于单个图像呈现Deadsim,用于条件图像操纵的生成模型。我们发现广泛的增强是启用单个图像训练的关键,并将使用薄板样条(TPS)作为有效的增强。我们的网络学习在图像本身的图像的原始表示之间映射。原始表示的选择对操纵的缓和和表达性产生影响,并且可以是自动的(例如边缘),手动(例如分段)或混合,例如分割顶部的边缘。在操纵时间时,我们的生成器允许通过修改原始输入表示并通过网络映射映射来进行复杂的图像更改。我们的方法显示在图像操纵任务上实现了显着性能。
translated by 谷歌翻译
Deep anomaly detection methods learn representations that separate between normal and anomalous images. Although self-supervised representation learning is commonly used, small dataset sizes limit its effectiveness. It was previously shown that utilizing external, generic datasets (e.g. ImageNet classification) can significantly improve anomaly detection performance. One approach is outlier exposure, which fails when the external datasets do not resemble the anomalies. We take the approach of transferring representations pre-trained on external datasets for anomaly detection. Anomaly detection performance can be significantly improved by fine-tuning the pre-trained representations on the normal training images. In this paper, we first demonstrate and analyze that contrastive learning, the most popular self-supervised learning paradigm cannot be naively applied to pre-trained features. The reason is that pre-trained feature initialization causes poor conditioning for standard contrastive objectives, resulting in bad optimization dynamics. Based on our analysis, we provide a modified contrastive objective, the Mean-Shifted Contrastive Loss. Our method is highly effective and achieves a new state-of-the-art anomaly detection performance including $98.6\%$ ROC-AUC on the CIFAR-10 dataset.
translated by 谷歌翻译
在本文中,我们基于单个图像呈现Deadsim,用于条件图像操纵的生成模型。我们发现广泛的增强是启用单个图像训练的关键,并将使用薄板样条(TPS)作为有效的增强。我们的网络学习在图像本身的图像的原始表示之间映射。原始表示的选择对操纵的缓和和表达性产生影响,并且可以是自动的(例如边缘),手动(例如分段)或混合,例如分割顶部的边缘。在操纵时间时,我们的生成器允许通过修改原始输入表示并通过网络映射映射来进行复杂的图像更改。我们的方法显示在图像操纵任务上实现了显着性能。
translated by 谷歌翻译